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Abstract 

For proliferating cells subject to both division and death, how can one estimate the average 
generation number of the living population without continuous observation or a division-diluting 
dye? In this paper we provide a method for cell systems such that at each division there is an 
unlikely, heritable one-way label change that has no impact other than to serve as a distinguishing 
marker. If the probability of label change per cell generation can be determined and the proportion 
of labeled cells at a given time point can be measured, we establish that the average generation 
number of living cells can be estimated. Crucially, the estimator does not depend on knowledge 
of the statistics of cell cycle, death rates or total cell numbers. We validate the estimator and 
illustrate its features through comparison with published data and physiologically parameterized 
stochastic simulations, using it to suggest new experimental designs. 


1 Introduction 


Given a proliferating population of cells starting from one or more progenitors, a natural quantity 
of interest in cell biology is the average number of divisions per cell since the initial progenitors, i.e. 
the average generation of presently living cells. The average generation is related to the population’s 
turn-over rate and can potentially be used to quantify dynamics and aging of the immune system 
[Ml [281 m Ho] , to better understand the evolution and risk of cancer [1511121ES] , and to rank cell types 
in hierarchies of complex differentiation programs [13 Eg. 

Estimating average generation is a simple matter if cell lifetimes are all equal and the division time is 
known or, alternatively, if total cell counts can be measured and there is no cell death. If, however, 
lifetimes are heterogeneous, the population is subject to death as well as division or total cell counts 
are not available, the issue is more involved. In these settings, several experimental techniques can be 
employed to estimate average generation, including time lapse microscopy, division diluting fluorescent 
dyes, and inference from somatic mutations and telomere length. 

The most unambiguous measurement technique is in vitro time lapse microscopy as it affords nearly 
direct determination of lineage trees and cell generation. Time lapse microscopy has been used to 
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study many cell systems including bacteria, lymphocytes, embryonic development, gut development, 
to name but a few, e.g. m Eoi Ei HI im mi na na na Ea HH] • Even it, however, has limitations as 
filming is not continuous, cells can leave the field of vision or form three dimensional structures that 
inhibit tracking. The complexity of image segmentation increases with cell numbers, and so time lapse 
microscopy has proved challenging if more than approximately 10 generations are to be followed. 

Another popular experimental approach, particularly in immunology, is to stain starting cells with 
a fluorescent dye such as Carboxyfluorescein Succinimidyl Ester [39l |38l [23] , CellTrace"'"'^ Violet or 
Cell Proliferation Dye eFluor 670 [35]. With each division, cells inherit approximately half of their 
parent’s dye and so fluoresce with half their intensity. A cell’s generation can thus be determined 
by its luminous intensity via flow cytometry. This approach is used both in vitro and in vivo, and 
allows the experimenter to start with a large number of progenitors without difficulty. It enables 6-10 
generations to be followed before dye dilution is such that the signal-to-noise ratio is too high for a 
cell’s generation to be reliably determined. 

Determining generation in vivo remains challenging as often it cannot be achieved by direct observation 
or cell stain methods. Estimating replicative history, cell depth and lineage trees has been proposed by 
measurement of average telomere length mJlIllSEllTlllSSlIli] or by the number of somatic mutations, 
which are introduced during DNA duplication |S71IS71|SHl|7niEni|7]. While methods in this direction 
rely on inference rather than direct observation, they offer the possibility of tracing more than 10 
generations in a wide range of species, including humans. 

In the present paper, we provide a novel average generation estimator that is designed with the in 
vivo setting in mind. The estimator is suitable for systems where cells undergo an unlikely, heritable 
division-linked label change with a determinable probability, where the label serves only as a marker 
and does not impact on cell dynamics. Chief amongst the estimator’s desirable properties are that: 
(1) it requires no information on cell lifetime distributions; (2) the population can be subject to death 
as well as division; and (3) only a proportion of label-positive to total cells needs to be measured. The 
present article introduces the estimator, analytically establishes its fundamental properties, validates 
its applicability by comparison with simulation and comparison with published data, proposes an 
experimental realization, and demonstrates its utility. 

In Section[^we describe the estimator and explain how it can be used. The estimator appropriateness 
is a consequence of theorems that are presented in detail in Section with a heuristic explanation 
and overview of them appearing first in Section]^ 

In Section Validation Using Simulated Data, we use Monte Carlo simulations to assess the esti¬ 
mator’s performance for physiological parameterizations. In Section Validation Using Published 
Data, we use several sources of publicly available data to illustrate the estimator’s applicability. We 
avail of complete lineage tree data for the development of C. elegans, as determined from time-lapse 
microscopy |52|, stochastically labeling it. Applying the estimator results in accurate inference of 
the average generation in comparison to the directly observed quantity. We also take data from two 
distinct experimental studies, one on human colorectal cancer cells m and one on murine embryonic 
fibroblasts [321; that employ micro-satellite mutation fluorescent reporter systems. Micro-satellite 
mutation is an unlikely division linked change and the fluorescence of cells in these systems serves 
as a label suitable for average generation estimation. The results show consistency between average 
generation estimates, measured quantities and values reported in the literature. 

As an illustrative example, in Section]^ Experiment Design, we propose a genetic construct, based on 
existing pieces, to facilitate average generation inference. We describe how the probability of label-loss 


2 


could be measured and how the method could be validated. In Section]^ Discussion, we conclude 
with experimental designs where the method would prove biologically informative. 


2 Estimator Overview 


We consider a system where cells are subject to a division-linked, heritable label change that serves 
as a measurable distinguishing marker, but does not influence population dynamics. The method 
is appropriate regardless of whether initial cells are label-negative and can gain the label, which is 
then inherited by their offspring, or are label-positive and can lose the label, with their offspring not 
regaining the label. For a consistent description, we phrase the paper in terms of the latter, but the 
results all hold mutatis mutandis. 

During each cell’s lifetime, assume that a label-positive cell becomes lab el-negative with a known, 
small likelihood, p. Let Z{t) be the total live cell count at time t and be the live cell count of 

label-positive cells. Assume that the initial cells at t = 0 are all label-positive and that at some time 
t the fraction of label positive cells to total cells, Z'^(t)/Z{t), can be measured. With G{t) denoting 
the sum of the generations of all cells living at time t and with G(0) defined to be 0, then given there 
are label-positive cells in the system at t, i.e. Z~^{t) > 0, we establish that the average generation of 
cells alive at time t satisfies the following relationship 


z{t)~ p^^\z{t))- 


( 1 ) 


That is, the average generation of the population can be estimated directly from the proportion of 
label-positive cells if the delabeling probability is known. 

Perhaps unexpectedly the formula Q does not depend on several diflicult-to-measure factors such as 
cell-lifetime distributions and total cell counts. Moreover the right hand side of 0 requires only a 
proportional measurement of label-positive cells, which can be determined from a sample, and the 
relationship holds even though the population could be subject to death as well as division. 

For the validity of Q, we have assumed that at t = 0 all cells are label-positive. As a result 
^'*'(0) = Z{0) and so both the right and left hand side of 0 are 0 at t = 0, agreeing irrespective 
of the initial starting number. If, however, not all cells are initially label-positive, the estimator and 
the average generation would not agree. This can be rectified if measurements of the proportion of 
positively labeled cells are available at two time-points, t 2 > ti. Then, irrespective of the initial 
composition of label-positive and label-negative cells, so long as ^^(^ 2 ) > 0, 


G{t) 

m 


t 


t2 — ti 



f Z+it2)Z{h) \\ 

{z+{t,)Z{t2)j)- 


( 2 ) 


This two-measurement estimator has an additional advantage when initial cell numbers are small. If 
the culture is started, as an extreme example, with a single progenitor, then the right hand side of 
Q can be subject to stochastic fluctuations at shorter time-scales (see the Monte Carlo simulations 
in Section]^. As established rigorously in Section]^ with ti > 0, the two-sample estimator in ([^ is 
more accurate than Q as it removes the influence of the timing of early cell events on the estimate. 
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3 Explanation of the Estimator’s Origin 


The estimators Q and ([^ have non-obvious forms. Utilising expansion properties of cumulant gen¬ 
erating functions, results in the following section show that the relationships hold, in expectation, for 
an arbitrary familial relationship. This includes, in particular, estimation of average generation for 
heterogeneous cell populations with distinct, potentially interdependent, proliferation characteristics. 

In the absence of a generative model of family trees, however, that derivation cannot provide in¬ 
formation about the development in time of the average generation of a family of cells. Nor can it 
be used to determine time-dependent properties of the estimators on a developing population. For 
a more detailed analysis of sample-path, multi-progenitor and time-dependent properties, a general 
mathematical framework for capturing the stochastic dynamics of a proliferating cell system subject 
to division and death, as well as heterogeneous cell life times, is that of age dependent branching 
processes [5^ US] where cells have random lifetimes as well as random numbers of offspring. Since the 
seminal work of Bellman and Harris [5] , it has been known that if cells are more likely to divide then 
die, given the cell population does not die out, the number of cells living at time t grows exponentially 
in time, Z{t) ~ exp(at), at a rate, a, dubbed the Malthus parameter, that depends on the lifetime 
and offspring number distributions of cells. This result is known to be robust, for example, to sibling 
dependencies BMM- 

In a cell system experiencing heritable one-way lab el-changes, label-positive cells can become label 
negative-cells, but the reverse is not possible. Thus the number of label-positive cells at time t also 
satisfies the branching process result, ~ exp(Q!“''t), but with a Malthus parameter that is smaller 

than that for the total label-independent population of cells, < a. The difference a — a~^ depends 
on the likelihood, p, that a label-positive cell loses its label. 


Fewer results have been established regarding the total generation of a cell population, the sum of 
the generations of all cells living at time t, G{t), for which new theorems can be found in Section 
1^ For the label-independent population, in substantial generality we prove that G{t) ^ texp(at). 
That is, the total generation increases faster than the total population size by a factor of t. Recalling 
that Z{t) ~ exp(Q;t), for a general age-dependent branching process, the average generation of the 
population grows linearly in time G(t)/Z(t) ^ gt for some g > 0. In the following section we give 
a deterministic result to provide non-probabilistic intuition for why this is so before considering the 
stochastic system. 

The clinching result quantitatively relates the phenomena of stochastic delabeling to generational 
expansion. So long as label-positive cells remain 

log log(exp((a+ - a)t)) = (a+ - a)t ^ -pgt, 

so that for p small 


G{t) 

Z{t) 


--log 

P 



This identification holds for any age dependent branching process and does not depend upon the 
details of life-time distributions or the possibility of cell death, so long as populations do not go 
extinct. 


The estimators are related through stochastic quantities and are subject to sample-path fluctuations, 
particularly at early time-points. Establishing that estimates, as a function of time on individual 
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sample paths, converge to the true average generation underlies the development of the two time- 
point estimator in equation ([^. That methodology circumvents this issue of small number variability 
by eliminating the early fluctuations on a path-by-path basis. As a supporting result, we also prove 
that the path-to-path variability of estimates decreases inversely proportionally to the number of 
progenitors, supporting the precision of estimates for experimental systems that are initiated with 
multiple cells, which is typically the case. 


4 Formal Results 


4.1 In-expectation derivation 


We begin by deriving a version of eq. 0 based on averages over realisations of the delabeling process. 
This derivation has the advantage that it requires no assumptions regarding the family tree and so 
holds in complete generality, but it is not informative with regards individual realisations. Sample- 
path, time-dependent properties of the estimators arise as a consequence of involved theorems in 
the context of the most well established generative model of family trees, age dependent branching 
processes, which follow this derivation. 

At some time t, consider a collection of Z{t) cells, whose familial relationship need not be known. 
The generation of a cell is defined to be the number of ancestors it possesses, with initial cells being 
defined to be in generation 0. Denote each of the generations of the Z{t) cells by gi{t),... ,gz{t){t)- 
of the generations of all living cells being denoted by 

Z(t) 

Git) = J29^it), 

i=l 

knowing p we wish to estimate the average generation of presently living cells, i.e. G(t)/Z(t), by 
observation of the proportion Z^ {t)/Z{t). 


If initial cells are label positive and with each division the label is lost independently with probability 
p, the probability cell i is still label positive in generation g^it) is (1 — . With Z'^{t) denoting 

the number of label positive cells, by linearity of expectation the average proportion of label-positive 
cells in the population is 


EjZ+jt)) 

Z{t) 


Z{t) 


Z(t) 

Ed 


p)9di). 


Identifying 9 = log(I —p), this can be re-written in a form that identifies it with a moment generating 
function, e.g. 


. Z{t) 

E (e®^) = ^ (^9^it)), 


where T is uniformly selected in { 5 i(t),... ,gz{t)it)}- As both the moment generating function and 
the cumulant generating function, logi?(e®^), are real analytic, we can take a Taylor expansion of the 
latter around the origin, giving 


log (T;(e®^)) = 0 + eE{T) + 0(9^), 
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so that 


lim Jlog(£;(e®r)) =i?(r). 


Taking 0 —>■ 0 is equivalent to taking p —)■ 0. Thus noting that limp_j.op/ log(l — p) = —1, we obtain 


lim — log 

p-j-O p 


( E{Z+{t)) \ 
\ Z{t) ) 


G{t) 

z{ty 


which is formula Q, albeit with an expectation over stochastic delabelings. While this result is not 
as strong as others we shall prove, it illustrates both how the unusual formulation arises and that, 
averaged over delabeling processes, the relationship holds for arbitrary family tree structure. 


4.2 Sample path properties 

To determine sample path properties of the generation counting estimator, we must establish two 
quantitative relationships in a general age-dependent branching process: (1) relating the growth rates 
of labeled and unlabeled populations when the probability of delabeling per cell division is small; and 
(2) for the relationship between the number of cells alive and the sum of the generations of all living 
cells. 

For the quantitative relationship between the growth-rates of the labeled and unlabeled populations 
when the delabeling probability per cell division is small and label-state does not impact on population 
dynamics, we leverage well-know single type results mii. For the relationship between the number 
of cells alive and the sum of the generations of all living cells, there are some results indirectly available 
from |55j . but to identify more properties of our estimator we find it beneficial to provide an analysis 
of the joint probability generating function of number of cells alive and their total generation number. 

Our initial mathematical study treats systems that start with a single initiating progenitor. From 
these results, systems that start with several statistically equivalent progenitors with indistinguishable 
progeny can be deduced. In addition, we provide results on the behavior of the estimator for a system 
with numerous progenitors, establishing that that variances behave inversely proportionally to the 
number of progenitors. 


4.3 Smoothness of the Malthus parameter 

We begin by considering a standard age dependent branching process, e.g. [22ll29], with the usual 
assumptions on the distribution of a lifetime, a non-negative random variable r, and the number 
of offspring, N a non-negative integer valued random variable, at the end of a lifetime. In the 
circumstances of interest to us, proliferating cell populations, N will take values in {0, 2} indicating 
that cells die or divide into two at the end of their lives, but we will prove the results in greater 
generality. 

Assumption 1. The lifetime distribution, P{t < t) fort € [0,oo), is non-lattice and satisfies P{t < 
0“*") = 0. The probability generating function of the number of offspring, pn{s) = E{s^) for s gM., is 
finite in a neighborhood of 1. We denote its expected value by h = E{N) = -^pn{s)\s=i- 
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A key quantity in the study of age-dependent branching processes is the Malthus parameter, a{h), 
which is defined to be the solution of the following equation, should the solution exist. 




( 3 ) 


If a{h) exists, it is then the, possibly negative, asymptotic exponential growth rate of the expected 
population size. The dependence of the Malthus parameter, a{h), on the expected number of offspring 
of a cell, h, is not usually made explicit, but will prove essential for us as we shall be interested in 
relating the growth rate of the label-positive population and of the total population, which will differ. 
For that purpose, we have the following result on the range of mean offspring, h, for which the Malthus 
parameter exists and its smoothness as a function of h. We expect that this Proposition is known, 
but cannot find a reference in the literature and so present a proof here. 


Proposition 1. Define 

fimax ■= sup {/? : E{e^'^) < 00 } and h^tn ■= inf{h : > 1}, 

where hmin < then there exists a real analytic function a : {hmin,oo) 1 —M, the Malthus parameter, 
such that 

hE{e~°‘^^'>'") = l for h& 

In particular, for h > the first two derivatives of a(h) as a function of the average number of 

offspring h are 


a'{h) 



1 


( 4 ) 


and 


a"{h) := 



1 / 

h^E{Te-°^W'^) ) ' 


( 5 ) 


Proof. Consider the function g : (0, 00) x M >->• [0, 00] defined by 

g{h,fi) = hE{e-P-)-l. ( 6 ) 

We wish to identify the range of h such that a{h) exists satisfying g{h, a{h)) = 0, and to determine 
its smoothness properties as a function of h. When /3 = 0, E{e~^'^) = 1 and so hmin < 1- As E{e~^'^) 
is the moment generating function of a non-negative random variable, r, evaluated at —fi, it is a 
real analytic function on the interior of the domain on which it is finite, i.e. for fi € (—fimax, 00). 
Therefore as g(h, /?) defined in eq. ®> is a product of real analytic functions, it is also real analytic on 
(0, 00) X (—finiax, 00). As g{h, fi) is a monotonic decreasing function of j 3 , for any h € (hmim 00) there 
exists a unique fi such that g{h, j 3 ) = 0. Thus we can apply the real analytic version of the Implicit 
Function Theorem, e.g. Theorem 1.8.3 [33]. This establishes that a{h) satisfying eq. ([^ exists for 
h G (hmin,00), that a is real analytic for h > hmin, and that its first derivative satisfies 

d / d 

a'ih) = -^5(^,^)l(/i.a(?*)) / 

which gives eq. Q. To obtain eq. ([^, one differentiates eq. Q. □ 
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The real analyticity of a{h) allows us to characterize the difference in growth rates of two populations 
in terms of the Taylor expansion of a{h). This will be useful as if there is small probability of label- 
loss, then the label-positive and total populations have similar, but non-identical, average offspring 
number. We will relate the difference in their growth rates to the dynamics of the average generation 
of the population. 


4.4 One way labeling populations 

We consider a two-type reducible age-dependent branching process previously studied, for example, in 
[26]. That is, we consider a cell system that starts with a single progenitor that is positive for a label so 
that = 1 and Z~(0) — 0, and define the total population at time t to be Z{t) = Z^{t) -\- Z~{t). 

Each cell is assumed to have an independent and identically distributed lifetime at the end of which 
they independently give rise to a random number of offspring. Positive label cells can become negative 
label cells, but the reverse does not happen. We are interested in populations where the label does 
not indicate a phenotypic change so that lifetime distributions do not depend on the label. 

Depending upon the experimental system and marker employed, the process of delabeling can occur 
in different ways and so one may have distinct models. Thus we make a general assumption that 
encompasses several of these. Notationally, let the random variable iV+(p) define the number of label¬ 
positive offspring of a label-positive cell, where p S (0,1) captures the probability of delabeling. We 
will make a technical assumption that ensures that the distribution of the number of label-positive 
offspring of a label-positive cell is less than its total (i.e. label positive and negative) offspring and 
that as p drops to 0 we assume that we recover the random variable describing the total (i.e. label 
independent) offspring of a cell, denoted N := 7V+(0). 

Assumption 2. Lifetimes are independent and identically distributed, and independent of independent 
and identically distributed type-dependent offspring numbers. The lifetime distribution for cells, P{t < 
t) for t S [0, oo), is non-lattice and satisfies P{t < O’*') = 0. If p > 0, the number of label-positive 
offspring of a label-positive cell is less than its total number of offspring: N^(p) is strictly stochastically 
dominated by N = N'^(0), i.e. P{N^{p) < n) > P{N < n) for all n € {0,1,...} and P{N^{p) < 
n) > P{N < n) for some n. Thus, defining E(iV+(p)) = /i+(p), h^{p) < E{N) = h for all p > 0. We 
assume that pn{s) = E{s^) for s G M., is finite in a neighborhood of 1, that linip^o ^+(p) = h, and 
that /i_|_(p) is real analytic in p. 

Example 1: delabeling occurs immediately prior to a cell’s division, independently with probability p. 
In this case the offspring of a label-positive cell’s are either all labeled or all delabeled. Thus with N 
denoting the label-independent offspring random variable, 

E =p^(l- p)E (s^) =p+{l- p)pn{s) 

and, in particular, h+{p) = (1 — p)h. 

Example 2: delabeling of the offspring of a label-positive cell occurs independently with probability p 
at birth. In this case 

E H H (™) (1 - 

n>0 m>n ^ 

so that again h+{p) = (I — p)h, but higher order moments are smaller than those in Example 1. 


Example 3: the offspring of a label-positive cell experience asymmetric label-loss. There are genetic 
constructs where, on division, labels are lost asymmetrically m- That is, if a label-positive cell 
divides in two and one of its two offspring loses its label, then the other does not. As a result, if p is 
the probability of delabeling and q is the probability that a division rather than death occurs, then 

E =q(^sp + 5^(1 - p)) , 

E{N) = h = 2q and the mean number of label positive offspring is E{N~^{p)) = h+{p) = (1 — p/2)h. 

In all three of these examples h+{p) is a linear function of p, so that Assumption]^ holds. Based on 
results in EHE] in conjunction with Proposition]^ the following theorem can be deduced. 

Theorem 1. If h^{p) > 1, under Assumption^ if (t) > 0, then almost surely 

= p/i^(0)a'(/i) -h {h'j^{Qfa”(h) + a'(h)h" (0)) -h O(p^), (7) 

where 

h+{0) = ^h+{p)\p^o and h" (0) = ^h+{p)\p=o, 
and, in particular, z/limt_>oo Z'^{t) > 0, then almost surely 

I'lSlhi ^ ‘“'5 (^) = KiOWihl (8) 

Proof. The asymptotic behavior of Z~^{t) and Z(t) follow directly from [^[3]. By Proposition]^ 
a{h) is real analytic for h > hmin and 1 > hmin, and, by Assumption]^ ^-i-(p) is real analytic with 
limp^o h+{p) = h. Thus a(/i+(p)) is real analytic and we can take a Taylor expansion of a(h+(p)) in p 
around 0 and so eq. Q holds. The relationship in Q follows from eq. Q taking the limit as p tends 
to 0. □ 

For comparison with these two-population results, we need to establish properties of the total gener¬ 
ation, the sum of the generations of living cells, for a single-type population. 


4.5 Total generation of a branching process 

In order to precisely define the total generation of a branching process, some unwieldy definitions are 
necessary. These shall be largely side-stepped in the analysis, but are included to precisely define the 
objects of interest to us. We use a modification of ideas in [22] to do so and consider a single-type 
construction as that is sufhcient for our needs. 

The generation of an object is defined to be the number of ancestors it possesses. As such, it is natural 

(i) 

to begin with a single object at time 0 defined to be in generation 0. Let rj denote the length of the 
life of the j-th object in the f-th generation and let denote the number of offspring of the j-th 


9 




object in the i-th generation. Starting with a single object, the total number of objects that will, at 
any stage, exist in the n-th generation is governed by a Galton-Watson recursion: 

Kr, 

Kq = 1 and Kn+i = ^ for n > 1, 

k=l 

where the empty sum is defined to be 0. That is, the number of objects that will ever be in generation 
1 is the offspring number of the hrst object, Ki = The number of offspring that will ever exist 

in generation 2 is the sum of the number offspring of those members of generation 1, and so forth, 
leading to the above recursion for Kn- 

Each object in generation n is defined by a life-history, < ... ,in >, where this object is the i„-th 

child of the i„_i-child of the i„_ 2 -th child and so forth back to iq = 0. The indices run over objects 
that exist, so that € {1,..., Kn\. The birth time of the object < io, ■ • ■, in > is the sum of lifetime 
of its ancestors. 


n—1 

fc =0 

The time the object ceases to be is In terms of these definitions, the total number 

of objects present at time t can be identified as all objects of any generation that are alive at time t, 

OO 

n—0 <io,...,in> 

where x is the indicator function. That is, the population alive at time t is all those objects whose 
birth time is before or equal to t and whose death time is after t. 

The generation of an object < iQ,...,in > is n and thus the sum of the generations of all objects 
existing at time t is 


G{t) = ^n Y. ( 10 ) 

n—0 

That is, for each object alive at time t in generation n, the total generation at that time is incremented 
by n. 

Before considering the probabilistic system, we begin with a fundamental, non-probabilistic lemma 
that will give an indication as to why one expects that the total generation of all living cells G to 
grow in a similar fashion to Z log Z, where Z is the number of presently living cells, assuming that 
cells can either die or divide into two at the end of their lives. 

Lemma 1. For a binary family tree beginning with a single progenitor in generation 0, if there are 
Z > 0 cells alive then the average generation per cell satisfies GjZ > log 2 Z. 

Proof. Consider a family tree with Z = 2"^ + m cells alive (i.e. Z leaves), where n € {0,1,...} and 
m € {0,1,..., 2"+^ — 2”}. The minimal G given Z is attained is when 2” — m cells are in generation 
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n and 2m cells are in generation n + 1, which can be formally established using the equations given 
above. Thus 


Z log 2 Z = (2” + m) log2(2” + m) and G = (2” — m)n + 2m{n + 1). 

If m = 0 or TO = 2", these are equal and so the difference between the two is zero at each 2" and one 
needs to be delicate with one’s estimates. A rearrangement of terms gives .^log 2 Z < G if and only if 

(2" + to) log2(l + to / 2 ") < 2m. 

A sufficient condition for this to be true is to relax the problem, letting to = a;2" with x G [0,1], and 
establish that 


(1 + x) log 2 (l +x) < 2x for all x G [0,1], 
which is readily achieved by calculus. 


□ 


In a system with death, in general there is no equivalent upper bound to Lemma[2as it’s possible that 
there is only one cell alive, Z = 1, and the generation of that cell, and hence G, can be arbitrarily 
large. For branching processes, however, under general conditions the population either dies out or 
ultimately grows in a somewhat regular fashion, |22| . and so Lemmaanticipates that, in a suitable 
sense, G(t)/(tZ(t)) becomes constant if the population survives. Indeed, amongst other results, this 
is something we shall establish. 


It is not possible to study the total generation of living cells, G{t), separately from the total population, 
Z(t), as they are strongly coupled and their dynamics are intertwined. This relationship appears as 
a cross term in eq. (12) when we consider the evolution of the joint probability generating function 
of the two, giving rise to an integral equation of unusual form. Despite this form, it is, in general, 
susceptible to analysis. 


Assumption 3. Lifetimes are independent and identically distributed and independent of independent 
and identically distributed offspring numbers. The lifetime distribution, P{t < t) fort G [0, oo), is 
non-lattice and satisfies P{t < O’*') = 0. The probability generating function of the number of offspring, 
Pn{s) = E{s^) for s G K, is finite in a neighborhood of 1. We denote its first and second derivative 
at one by 

h = E{N) = ^pNis)\s=i and v = E{N‘^) -h = ^pn[s)\s=i. (11) 

Theorem 2. Under Assumption^ starting with (Z(0),G(0)) = (1,0), the joint probability generating 
function of {{G{t), Zif))}, 


PG,z{sg,s„,t) := E , 


satisfies the integral equation 

PG,z{Sg,S:,,t) = P{t > t)s^ / dP{T <u)pNipG,ziSg:SgS:,,t-u)). 

Jo 

For h > hmin, define 


c{h) = 


a'{h){h — 1) 
a(h) 


( 12 ) 


(13) 
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where a{h) and a'{h) are defined in the statement of Proposition^ For h > hmin, 


and 


lim — ^ = c{h)ha'{h), 


lim = ha'{h). 


t-^-oo tE(Z(t)) 


If h> 1, with V defined in eg. (11), defining 


K = 


vE{e 


— 2o!T \ 


1 - 


( 14 ) 


(15) 


(16) 


in addition we have that 

*->■00 tE{Z{t))^ 


= ha'{h)K and lim 


E{Gm 


t->oo t‘^E{Z{t)y 


= {ha'{h))^ K. 


Proof. We first establish that eq. (12) holds. Begin with a family tree starting with a single cell at 


time 0, Z(0) = 1, defined to be in generation 0, so that G{0) = 0. Consider the impact of altering this 
initial condition so that the founding cell is in generation 1 at t = 0, G'(O) = 1. We will write Gi{t) for 
the process counting the total generation of living cells starting with G(0) = 1, so that, in particular, 
Gi(0) = 1, and retain G{t) for total generation given G(0) = 0. The key observation, which can be 
established using equations eq. (10) and eq. (|^, is that 


OO 

= ^(^ + 1 ) ^ X < t < = G{t) + Z{t). 


n—1 


E X 


This can be understood by noting that, leaving all other aspects of the family tree unchanged, the 
move from G(0) = 0 to G(0) = 1 causes the generation of every living cell to increase by 1. 

Thus, taking expectations over the lifetime of the initial cell and using the independence and identical 
distribution of the sub-trees generated by the first birth, but noting they start with cells that are in 
generation 1, we obtain 


PG,zisg, s^fi) = P{t > t)s^ 
= > t)sz 

= P{t > t)sz 
= P(t > t)Sz 


[ dP{T< 

Jo 

f dP{T< 

Jo 

f dP{T< 

Jo 

f dP{T< 

Jo 


u)pn 

u)Pn 

u)pn 

u)pn 


{e {sg^dt-n) 

{e (^Sg°d-u) + Z(t-u) ^^Z{t-u)y^ 

{PG,z{Sg,SgS^fi-U)), 


giving eq. (12). The uniqueness of pG,z[sg, Szfi) for Sg < 1 and Sz 
arguments to those in [^, Chapter VI, utilizing the ideas in 


< 1 follows from analogous 
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To establish eq. (141, note that using eq. (12) 


E{G{t)) = = [ dP{T<u)hE{G{t-u))+ f dP{T <u)hE{Z{t-u)) 

OSg Jq Jq 

dP{T < u)hE{G{t — u)) + E{Z{t)) — P{t > t), 

where we have used the equivalent, well known integral equation for E{Z(t)) to obtain the final 
equality. We wish to study the asymptotics of this equation, dividing it by te7rp{a{h)t) and taking 
limits as t —> oo. To this end, we employ a result from renewal theory, [2] Theorem 6.2(a), using the 
fact that by construction hexjp(—a{h)t)dP{T < t) is a probability measure, 



lim 


EjGjt)) 

te^Wt 


1 


( lim - lim > t)] , 

\t—¥(X> t—^oo / 


should the limits on the right hand side exist. The existence of the first limit follows from results in 

HU, 


lim = c{h). (17) 

t—¥00 

Consider > t). If a{h) > 0, then the limit automatically exists and is 0. If a{h) < 0, select 

P > —a{h) > 0 such that P(exp(,ST)) < oo. From Proposition]^ we are guaranteed the existence of 
such a P as h > hmin- As /3 > 0, as in the Chernoff bound we have that 

P(r > t) = E (!{,->(}) < e~^*E{e^'^) and hence limsupe^*P(T > t) < E{e^'^). 

t—¥oo 


As P > —a{h), this ensures that 
Hence, using eq. (|^, 


lim > t) = 0 . 

t—^OO 


lim = c{h)ha' (h), 


t—^Qo i;.^Gi(h)t 

as required. To obtain eq. (151, one uses eq. 0 in conjunction with eq. ( |14[ ). 
Consider now, with a little re-arranging of terms. 


E{Git)Zit)) = 


52 


dszdsg 


PG,Z (^ 3 ; t)|s^—— 


= f dP(T < u)hE{G{t — u)Z{t — u)) + f dP{T < u)hE[Z(t — uY) 

Jo Jo 

+ f dP{T < u)vE{Z(t — u)Y + f dP(T < u)vE{G{t — u))E{Z{t — u)). 
Jo Jo 


We wish to establish that the limit 


lim 

t—^OO 


E{G{t)Z{t)) 

^g2a(/i)t 


( 18 ) 
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exists and identify its limit. Using eq. ( [T7| and 

lim = c{hfK, 

t—^OO 

which can be found in [^, it is clear that only the last term on the right hand side of could be 
non-zero. As < t) is a defective probability measure, that is < t) < 1, 

an application of the Renewal Theorem for Defective Measures m in conjunction with the Dominated 
Convergence Theorem leads us to 


E{G{t)Z{t)) 


1 _ /j^^g-2a(/t)T^ t^ao Jg y — J ^^a{h}(t-u} ga(h){t-u) 

-2a(h)T\ 

=v -— 7 — ,,, c(h)^ha'(h) 


A similar argument reveals the other equality. 


□ 


From eq. ([^ and eq. (15), in Theorems and respectively, we have our cornerstone result relating 
the average generation of the population to the proportion of positively labeled cells, justifying ([^ 
and ([^. 

Proposition 2. Under Assumptions^ and^ ?/limt_,.oo (t) > 0, then 

lim = ha'{h) = lim lim - log { 

t^oatE{Z(t)) p-J-O t->oo pt Z{t) ) \h^(0) 


almost surely. 

Consider the delabeling possibilities described after Assumption In Examples 1 and 2, where 
delabeling occurs with probability p at the end of a lifetime or independently with probability p for 
each offspring, h/h'j^{Q) = —1 and we have that 

t^ootE{Z{t)) p^ot^oopt y Z{t) J 

almost surely, if limt_>oo Z~^{t) > 0. Whereas, as in Example 3, if delabeling occurs via an asymmetric 
division construct, h/h'^{Q) = —2 so that 

E{G{t)) 1, fZ+{t)\ 

t^aotE[Z{t)) p-J-O t->oo pt y Z{t) J 

almost surely if limt_,,oo Z^{t) > 0. 

We have proved results for E{G{t))/E{Z{t)), experimentally we would wish to know E{G{t)/Z{t)), 
the average generation per progenitor. Thus we wish to establish that these results are a good 
approximation for the latter. To achieve this we use the standard approach, e.g. [22], of taking a 
Taylor expansion of the ratio around the ratio of the expectations, and quantifying how the expectation 
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of its leading order terms behave. Using the first and second order terms of the Taylor expansion, we 
have 


E 



\Z{t) > 0 


mm (.. E{z{t?) \ _ E[G{t)z{t)) 

E{Zit)) { ^ E{Z{t)Y) E{ZitW 


Using the results in Theorem if h > h^in 

E{zm) 


t^oo t \E{Z(t)) 


1 + 


E{Z{t)f 


E{G{t)Z{t)) \ 
E{Z{t)Y j 


= ha'(h) = lim 


E{Git)) 


t-^oo tE{Z{t)) 


Thus using the ratio of the expectations as an approximation to the expectation of the ratio is 
reasonable. 


We can also use the Taylor approximation to determine the approximate behavior of the variance of 
G{t)/Z(t). Namely, to first order 

E{Gm) 2E{G{t)Z{t))E{G{t)) E{G{t)fE{Z{tY) 

^ E{Z{t)Y E{Z{t)f ^ E{Z{t)Y ■ 

Using the results in Theorem if h > 1, this shows that the variance is becoming small, 

( E{G{tf) 2E{G(t)Z{t)) E{G{t)) ( E{G{t)) \ m{Z{tf) \ 

t^oo tE{Z{t)Y tE{Z{t)) \tE{Z{t)) J E(Z{tW) 

Thus we expect the average generation of a population of cells to be a stochastically well behaved 
process; a phenomenon we will observe in simulations described later. 


4.6 Large numbers of progenitors 


Cultures are typically started with more than a single progenitor and so one expects that the esti¬ 
mator’s accuracy will improve by laws of large numbers. Here, by means that have nothing to do 
with branching processes, we establish that this is indeed the case and that, in particular, the ratios 
are both Asymptotically Normal, e.g. [56) . and the variance decreases as one over the number of 
progenitors. 

In order to state the result, recall that a sequence of random variables {X^} is said to be Asymptoti¬ 
cally Normal, e.g. [SS], with means and variances {cr^} if > 0 for all n sufficiently large and 
the sequence {{Xn — ^J,n)/o'n)} converges in distribution to the Gaussian distribution with 

mean 0 and variance 1. 


Theorem 3. Let {(Z'^i(t), Zi{t))} be a bivariate i.i.d sequence of possibly correlated random variables, 
representing the label-positive and total cell population at time t from progenitor i, and let {Z'^, Z) be 
an independent copy. If the probability generating function of {Z~^, Z) is finite in a neighbourhood of 
(1,1), then 


--log 

P 
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is Asymptotically Normal with 

1 E{Z+) ^2 11 fVar{Z+) ^Cov{Z+,Z) Var{Z)\ 

Hn -log E{Z+y ~ ^E{Z+)E{Z) ^ E{Z)^ ) ’ 

where Cov{X,Y) = E{{X — E{X)){Y — E{Y))) and Var{X) = Cov{X,X). 


(19) 


Let {{Gi{t), Zi{t))} be a bivariate i.i.d sequence of possibly correlated random variables, representing 
the total generation and total cell population at time t from progenitor i, and let (G, Z) be an indepen¬ 
dent copy. If the probability generating function of (G, Z) is finite in a neighbourhood of (1,1), then 
X)r=i ^i(0/(X)r=i *5 Asymptotically Normal with 


E{G) , ^_l(Var{G) ^Cov{G,Z)E{G) ^ Var{Z)E{Gf\ 
E{Z) n\ E{ZY E{ZY E{Z)^ ) ' 


( 20 ) 


Proof. Both results start with an application of the Multi-variate Central Limit Theorem, e.g. [5S] . 
to the partial sums of {{Z^i{t), Zi{t))} and {{Gi{t), Zi{t))} to establish their Asymptotic Normality. 
Application of the Delta Method, the corollary on page 124 of m, with the function g{x,y) = x/y 
then establishes Asymptotic Normality of the ratios of the sums. An additional application of the 
Delta Method with the function g{x) = —p~^ log(a;) establishes the Asymptotic Normality of the 
logarithm of the ratio of sums. □ 


As a result of this theorem, we anticipate the variability in the average generation number, as well as 
the variability in estimator, to decrease significantly as the number of progenitors increases. 


5 Validation Using Simulated Data 

Here we use Monte Carlo simulations of age dependent branching processes [HI |H] to validate the 
method, investigating its appropriateness on individual stochastic sample paths for parameter values 
of the order one would anticipate experimentally. 

Define Z'^{t) to be the number of label-positive live cells at time t, Z{t) to be the total number 
of living cells at time t, G{t) to be the sum of the generations of all living cells at time t, so that 
G{t)/Z{t) is the average generation of living cells at time t. In this section each cell’s lifetime is drawn 
independently from the same distribution and, at the end of their life, each cell independently either 
dies or gives rise to two cells. We assume that label-positive cells delabel immediately prior to division 
with probability p. 

One of the estimator’s primary features is that it can be used in practice without knowledge of 
the lifetime distribution or death rates. Thus we performed stochastic simulations over a range of 
conditions, including lifetime distributions such as lag-exponential, lag-log-normal and lag-gamma, 
and parameter values such as p G (10“®, 10“^), t G [0, 300] hours and the average number of offspring 
of a cell, E{N) = h, in (0, 2). These values include those typically encountered in cell cycle experiments 
[mlEll EH [Ml EQlin] and demonstrated the merits of the method, even for t being relatively small, 
at a few days, and p of order 0.01 per cell per generation. To illustrate the insights gained from 
the simulation study, we report on a specific, representative example, where lifetime distribution is 
lognormal, which is found to be a good fit in [H]- Figures]^ andare the equivalents of[^and[^ but 
for delayed-gamma distributed cell lifetimes. The results for both of these settings, as well as others 
that we have explored, are qualitatively similar. 
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5.1 Simulation parameterization 


For the lifetime distribution we assume a log-normal distribution with parameter values based on data 
from long-term video microscopy of murine B lymphocytes proliferating in response to CpG DNA |24| , 
T is log-normally distributed with mean 9.3 hours and standard deviation cr = 2.54 hours. We consider 
populations that are, on average, expanding. 

For the probability of delabeling, we adopt estimates related to existing division-linked labeling sys¬ 
tems, namely Cre-induced mitotic recombination [651136] and micro-satellite mutation induced label 
activation [591130| . The probability of Cre induced mitotic recombination has been estimated for a 
specific construct as 10“^ per cell per generation [65], while mutation rates in human micro-satellites 
were inferred to be in the (10“"^, 10“^) range per locus per generation [B^ . 

Cell populations are simulated for T = 250 hours. At the end of a cell’s lifetime, it divides into two 
cells with probability 0.8 and dies with probability 0.2, giving an average number of offspring h — 1.6. 
This leads to an average expansion by a factor of approximately 10® at 250 hours and an average 
increase in the generation number of more than 25. Note that 25 generations is far beyond what 
current experimental method, such as continuous observation or division-diluting dyes, are able to 
measure. 

The mathematical results hold for sample population ratios where label-positive cells persist. If at 
some time t S (0,T] Z^{t) — 0, we do not report the path from t on as this represent samples for 
which the experimenter would have no estimate beyond that time. 


5.2 Single progenitors: generation consistency, label ratio variability at 
early times 

Systems that start with a single label-positive cell are subject to the greatest variability at early 
times and thus this situation is the one that poses the greatest difficulties for estimation. While most 
experimental systems will start with more than a single progenitor and we know from the mathematical 
results in Section |4.6| that they are better behaved, with variance in estimates decreasing inversely 
proportional to starting cell number, we begin by considering this most challenging circumstance. 

Starting with a single label-positive progenitor at Z'^{0) = 1, panels (a) and (b) of Fig.[^plot 15 realiza¬ 
tions of both the average generation G'(t)/2'(t) (orange lines) and the estimate of it, —l/plog{Z~^{t)/Z(t)) 
(blue lines). The average generation across realizations is consistent, even for early times, and grows 
linearly in time as Theorem [^predicts. 

In contrast, the per-realization estimates, the blue lines, which the sample path theory establishes 
will ultimately grow with the same slope as the average orange line, exhibit substantial variability at 
short time-scales. Two dynamical properties contribute to these fluctuations. At the beginning, there 
are no label-negative cells and so the logarithm of the ratio is 0. When the first label-negative cells 
appear, the ratio of label-positive to total cell population can dramatically change, especially if the 
population size is small. In effect, the theorem is not in force until both the label positive and label 
negative populations are large enough for average behavior to become dominant, which only happens 
after at least an order of 1/p division events have occured. As one might expect, by decreasing p from 
10“^ to I0“® this effect is amplified, as can also be seen in Fig. [^(a) and (b). 

In this setting, the typical behavior is initially for underestimation as can be seen in the inter-quartile 
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range plot of panels (c) and (d) of Fig. for 10^ simulations. This can be understood as there is no 
estimate prior to at least one cell delabeling, which requires, on average, 1/p divisions to occur. As p 
is assumed small, typically one does not intially get an estimate and this delay explains the observed 
lag in the blue lines. On the other hand, in the unlikely event that an early cell delabels, then the 
approximation eq. 0 initially gives signifcant over-estimates. This bias, which we have observed 
consistently across other simulations, suggests that if starting with a single progenitor family, eq. Q 
only becomes accurate at longer time-frames, i.e. after several generations have passed. Note that the 
mean value of the estimator at early times over multiple runs is effected by outliers, hence the mean 
lies outside the inter-quartile range. 


5.3 Single progenitors: mitigating variability by nse of two time-points 


Results in Section|^for the time-dependent sample paths of estimates suggest that one way to mitigate 
this initial variability in the single time-point estimate —l/p\og{Z~^{t)/Z{t)) is to make two measure¬ 
ments at distinct times, ti < t 2 and use their difference to estimate G(t)/Z(t) via the approximation 
eq. 0 in Section 


G{t) 

m 


t 


t2 — ti 



f Z+{t2)Z{h) \\ 

{z+{t,)Zit 2 )))- 


The effect of this difference measurement is the removal of the initial fluctuations on a path-by-path 
basis. This is illustrated in Fig. panels (e) and (f). Using Monte Carlo methods and a standard 
kernel estimation method, the plots compare three values: the actual average generation G{t 2 )/Z{t 2 ) 
at t 2 = 250 hours (orange); the single time-point estimates —l/p\og{Z^[t 2 )/Z{t 2 )) (blue); and, with 
t 2 = 250 and an extra measurement at ti = 200 hours, the two time point estimates in eq. 0 with 
t = t 2 = 250 hours. The two-point estimate is not only more symmetrically distributed around the 
true value, but its variability is significantly less than the single time-point estimates. 


5.4 Multiple progenitors: improved consistency 

While the method works well for single progenitors, experiments are typically seeded with multiple 
progenitors. Theorem establishes that the variability decreases inverse linearly with the number of 
progenitors, Z(0), and so one expects that starting with even a small number of cells will eradicate the 
early time variability and here we demonstrate that this is, indeed, the case. For Z'^(0) = 100, Fig.|^ 
panels (a) and (b) correspond with those of Fig. The orange lines, showing the per-realization of 
the average generation, remain largely unchanged, but the accuracy of the single time point estimates 
(blue) are greatly improved. 

This comes about as even if an individual progenitor’s family generates label-negative cells relatively 
late this is balanced by another progenitor’s family where delabeling happened relatively early. Thus 
both Z'^{t) and Z{t) rapidly become sufficiently substantial for the limit theorem to be in force and 
for the estimator to be precise. 

Fig. [^panels (c) and (d) show Monte Carlo created Box plots of the per-sample average generation, the 
single time-point estimate and the two time-point estimate for a range of initial progenitor numbers. 
Sample-to-sample variation is greatly reduced even with as few as 10 progenitors. Thus in standard 
experimental systems, which are often seeded with hundreds or thousands of cells, the estimator is 
expected to be precise. 
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6 Validation Using Published Data 


The simulations of the previous section show that the estimator works well for parameterizations akin 
to those found in experimental systems and provide insight into the impact of a number of factors, 
such as the delabeling probability and the number of progenitors, on the quality of the estimates. 
Here we use a range of published data to further investigate the estimator’s applicability. 


6.1 Early development of C. elegans 

The first application takes lineage trees of the early development of C. elegans determined by time- 
lapse microscopy and reported on in |52j . As the entire family tree is known, we have direct access to 
measurements of average generation as a function of time. Simulating a stochastic delabeling process 
on this tree, as would be experimentally possible via a division linked mosaic construct similar to the 
one introduced in m or the one we propose in Section we can directly investigate the estimator’s 
accuracy. 

Fig. i (a) shows the embryonic lineage tree of the nematode Caenorhabditis elegans as constructed 
from data published by Richards et al. [S2]. In their study, a reference lineage tree was constructed 
based on output from automated tracking of nuclei in 18 embryos, recorded over approximately six 
hours at 1.5 minute intervals by three-dimensional resonance-scanning confocal microscopy. Each node 
in the figure represents a cell division and the tree contains information about all parent-daughter 
relationships and gives the timing of more than 10^ division events during an embryo’s development. 
This leads to a population size of approximately 600 cells. Fig. [^(b). From the timing of the division 
events, lifetimes are readily computed, shown in Fig. [^(c). 

For this tree, lifetime durations correlate positively with both generation and with the time of birth 
relative to fertilization. Fig. (d), illustrating a lack of homogeneity. Regarding independence, there 
are correlations throughout the tree. Fig. (e) demonstrates, for example, that lifetimes of siblings 
are positively correlated. These features are not in line with the assumptions under which some of the 
the sample path properties of the estimator were established, but are consistent with the derivation 
via properties of the cumulant generating function in Section]^ 

To test the accuracy of the estimator, we stochastically decorated the tree. Beginning with a sin¬ 
gle label-positive progenitor, daughters are delabeled with probability p. Should they remain label¬ 
positive, their daughters are delabeled independently with probability p and so on for the whole tree. 
Fig. i (a) illustrates one such random decoration, with blue indicating label positive and white label 
negative. 

For 10^ independent Monte Carlo realizations of this delabeling process. Fig. [^(f) shows the average 
(blue) and the inter-quartile range of the estimates (blue), using Q, at each time point during the 
first six hours of the development as well as the true average generation number of the embryo in 
orange, proving to be accurate. 

As explained in Section if not all cells are initially label-positive one can estimate average generation 
from proportion measurements at two distinct times. To mimic this, we considered the development 
of the labeled trees after ti = 150 minutes. At ti = 150 minutes one measurement is taken, giving the 
proportion of label-positive cells and the approximation eq. (§ to determine the average generation 
growth since ti to the time of a later measurement to determine the average generation since ti. The 
results are plotted in Fig. (g), which show accuracy. 
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6.2 Micro-satellite mutation reporter systems 

Existing micro-satellite reporter systems implement division-linked labeling and are suitable for both 
in vitro and in vivo applications [Ml [la EH 1301 in]- Micro-satellites are short repeating motifs found 
in DNA that, with small probability per cell division, are subject to insertion or deletion of copies of 
the motif. Mutation reporter systems possess an initially out of frame gene for a fluorescent protein. 
Micro-satellite mutation on division results in the gene becoming in frame, with the cell becoming 
fluorescent. This fluorescent state is inherited by offspring and is either irreversible, as in |31j . or 
has a reportedly negligible likelihood of reversion naEi]. Thus with micro-satellite mutation acting 
as a driver of rare division linked change, cell fluorescence serves as a label for average generation 
estimation. 

The two data sets we analyze are from studies published by Gasche et al. m on human colorectal 
cancer cells (HCT116) and Kozar et al. [33] on mouse embryonic fibroblast (MEF). The probability 
that a cell becomes fluorescent as a result of a micro-satellite mutation is estimated as 6.1 x 10“"^ 
and 1.1 X 10“^ in each study, respectively. For generation estimation, these values serve as p, the 
probability of delabeling per cell division. 

The data sets consist of measurements of proportions of labeled cells over several days from in vitro 
proliferating cell lines that carry transgenic micro-satellite mutation reporter cassettes, allowing us to 
estimate the average generation of the cell population. In addition to the proportions, |32j also provides 
the growth curve, which is exponential from day one to day four before slowing down, probably as a 
consequence of cell confluence [33] , while [TB] reports an exponential growth for the whole time-course 
of their experiment. 

Applying the estimator —l/plog{Z^{t)/Z{t)) with the proportions, Z'^{t)/Z{t), recapitulated in Fig.|^ 
(a-b), and label loss probabilities provided in the studies, we can estimate the average generation 
G{t)/Z(t) as a function of time. Fig. (c). For both data sets, the estimated average generation 
increases linearly over time at a rate of approximately one per day. 

Direct validation of the method with this data would require knowledge of the average generation, 
which is not reported in the above studies. The growth curve Fig. [^(d), however, allows an alternate 
method to estimate this quantity for comparison. We directly estimate the average from this curve by 
assuming for exponentially distributed life-times and the absence of cell death as in this special case 
one can show that E{G{t))/E{Z{t)) = 2/it, where p is the Malthus parameter [331 [39|. Estimating p 
from the growth curve from day one up to day four, using linear regression on log{Z(t)), predicts that 
the increase of average generation relative to the first time point as 1.06 t. This predictions matches 
closely the estimated average generation, Fig.|^(c). Taken together, the analysis of these data shows 
that inferring the average generation from the proportion of labeled cells carrying micro-satellite 
mutation reporter cassettes is an experimentally feasible approach. 


7 Experiment Design 

As an illustrative example of an ideal experiment including estimation of the probability of label-loss 
p, we describe one possible implementation of a division-linked one-way labeling system via a genetic 
construct. Fig. (a), that combines existing experimental techniques: expression of a fluorescent 
protein such as Blue Fluorescent Protein (BFP); the use of a cell-cycle specific promoter to create 
division-linked changes jug]; and site-specific recombination |44] . 
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The construct is composed of two elements. First CRE recombinase expression is placed under the 
control of a cell-cycle specific promoter [5il U. The cell-cycle promoter is employed to ensure a 
division linked expression of CRE recombinase. Then, two LoxP sites are placed at each end of the 
BFP gene. As cells enter cell cycle CRE recombinase is expressed, site-specific recombination will 
occur probabilistically between the two LoxP recombination sites [H] and the BEP gene is excised. 

Eor average generation inference, the desirable likelihood of recombination should be small, which can 
be achieved with a low efficiency CRE recombinase. Thus BEP-I- cells are regarded as label positive 
and BFP- cells are label negative. This design is similar in spirit to an existing one used to create 
mosaics in Zebrafish m, which employs micro-satellites as a division linked probabilistic driver and 
a kaloop in lieu of site-specific recombination. 

The proportion of label-positive cells in a cell system incorporating this construct can be readily deter¬ 
mined via fluorescence-activated cell sorting (FACS). The essential remaining ingredient for average 
generation inference is the determination of the probability of label-loss. This can be achieved by 
a in vitro FACS experiment, as illustrated by the in silico simulated experiment in Fig. 0(b)-(c). 
A collection of cells that incorporate the construct are stained with a division diluting dye, such as 
CFSE [33], with noise added for illustrative purposes. After division, cells are gated based on their 
generation, and the proportion of label-positive cells per-generation determined by their fluorescence. 
With BFP'''(n) being the measured proportion of label-positive cells in generation n, in the presence of 
a large number of initial progenitors the probability label-loss, p, irrespective of whether deaths occur, 
is 1 — BFP’^(n -I- l)/BFP’^(n), which should be the same for every n. In particular, if all initial cells 
are BFP-I-, then BFP^(O) = 1 andp = 1 —BFP~''(1) = BFP~(1), the proportion of BFP negative cells 
in generation 1. Thus the probability of label-loss can be readily determined experimentally. For this 
simulation, an estimated value of p is determined using this measurement for n = 0, p = BFP~(1). 

The accuracy of the method can then be checked during or after the experiment by comparing the 
average generation as determined by the division diluting dye with the value estimated via equation 
0 with the measured p, Fig. (d). Once p is known, similar cell stain experiments with distinct 
cell lines or distinct stimulii can be performed to validate the method’s inference before its use in 
vivo and for generation counts beyond the range possible with division diluting dyes or fluorescence 
microscopy. 

The construct described in this section is intended as an exemplar, creatable with present technology. 
Further experimental possibilities, including those based on naturally occurring mutations, can be 
found in the following Discussion. 


8 Discussion 

The estimators introduced in this article, eq. Q and eq. ([^, offer means to estimate a biologically 
significant quantity, the average generation of cell populations, for cell systems where direct mea¬ 
surement poses a significant challenge. Their appropriateness is not intuitively apparent and their 
development requires several mathematical results. Throughout the development of those results we 
assume that the label change is one-way, which can be shown not to be essential for the cumulant gen¬ 
erating function derivation of the estimators, but is relied on in establishing the sample-path results 
as if the label change is reversible, so that labels can be gained as well as lost, then asymptotically it 
is known that the growth rate of the number of label-positive cells equals the growth rate of the total 
number of cells [48] . 
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Despite its less than obvious genesis, the estimator has highly desirable features: it allows the popula¬ 
tion to be subject to death and division; it does not need to know the number of progenitors; it does 
not require knowledge of cell-cycle distributions; and, subject to knowing the per-division label-change 
probability, only requires the measurement of proportions. Comparison of estimates for simulations 
suggest that the estimator is accurate for physiological reasonable parameterizations. 

The estimators Q and ([^ depend on the proportion of label-positive cells in the population, where 
the label can be any of the cell’s properties that is lost with a small probability during its division 
cycle. Several naturally occurring and engineered cellular processes exist that approximately match 
this requirement. Somatic mutation is probably the best studied natural process of this kind. In this 
case germ-line cells are defined to be label-positive, while label-negative cells correspond to those with 
mutations in their DNA. The proportion of label-positive cells can be assessed by next generation 
sequencing. In humans there is a high rate of fidelity in DNA replication, giving an error rate 
of 5 X 10“^^ per base per division El and so using only a single nucleotide location as a label 
appears impractical. Considering instead the full germ-line genome as a positive label, even though 
each nucleotide may suffer mutations with distinct rates, and potentially in concert, there is still an 
over-all probability of label-change that has been estimated in human sperm cells as approximately 
p = 4 X 10’ -3 [m. In principle, this rate could be used as the probability of delabeling for our 
estimator. Alternatively, hotspots of mutation could also be analyzed. For example, micro-satellites, 
short repetitive sequences of DNA, mutate in humans with p in the range (I0“^, 10“3)j which we have 
used for the parameterization to evaluate the performance of our estimator. 

The use of hotspots of mutation as a label suffers from the difficulty that delabeled cells could relabel 
due to the occurrence of further mutations. The impact of this can be ameliorated by creating an 
asymmetry in the likelihood of relabeling to delabeling. One selects a set of N sites for measurement, 
each having their own probability of mutation, pi. One defines the label-positive cells as those that 
have germ-line values at all sites. The likelihood of a labeled cell delabeling is the likelihood that any 
of the sites mutate away from germ-line, p = i-nL(i ~Pi) ~ Pij if th® individual pi are small. 
The likelihood that a cell with a single mutation relabels is Pi p, creating a significant asymmetry. 
Moreover, for a cell that has had several mutations, the likelihood that all revert, which is necessary 
for relabeling, is smaller still. 

While it seems likely that one must rely on naturally occurring processes to estimate the average gen¬ 
eration of a cell population for humans, genetically modified cell lines and animals have the advantage 
that measuring the proportion of label-positive cells can be directly facilitated by fluorescent markers. 
The construct described in the Experiment Design section could be built or existing constructions, 
such as Cre-mediated sister chromatid recombination [6511361173] and micro-satellite mutation induced 
label activation [59l|30], can be adapted, so that reversion does not occur. Using fluorescence-activated 
cell sorting or live imaging technologies and the estimators developed in this article it could be possible 
to follow the average generation of specific cell populations, like neoplastic tissue or the skin, in living 
organisms over long periods of time. 

Recent evidence from in vivo cell lineage tracing techniques such as Cellular Barcoding m has 
demonstrated that apparently homogeneous progenitors give rise to heterogeneous families in cancer, 
immunology and hematopoesis, e.g. [57113111113115]. Theoretical work developing methodologies to 
interrogate data from these experiments is ongoing, but one key difficulty that must be overcome is 
that the generation of individual families is unknown |47| . Combining any of the above experimental 
methodologies and using the estimator developed in the present article would allow inference of the per 
progenitor average generation, enhancing the deductive power of these techniques, which have already 
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proved invaluable. Having established the estimators and validating them through simulations and 
comparison with published data, we believe these are promising avenues. 
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Figure 1: Single progenitor simulation validation. Simulations, described in detail in Section 
start with a single label-positive cell at time 0. Plots shown for two label-change probabilities, 
p = 10“^ (top row) and p = 10“^ (bottom row). For 15 independent realizations, orange lines display 
the actual average generation, G{t)jZ{t). Blue lines display the estimates —l/p\og{Z~^{t)/Z{t)) from 
([^, which ultimately increase linearly with the same slope as the average generation, but initially 
show significant variability, (c)-(d) For 100 independent realizations of the process, the dashed line 
reveals that the empirical average of the estimates is close to the true value. The blue region is 
an inter-quartile plot of the —l/p\og{Z~^(t)/Z{t)) estimates, with the upper boundary being where 
25 of the realizations are larger and the lower boundary being where 25 of realizations are smaller, 
which typically exhibits under-estimation, (e)-(f) Using Monte Carlo methods and a standard kernel 
estimator, these panels show, for t 2 = 250 hours, the density of observations of G{t 2 )IZ{t 2 ) in orange, 
the single time-point estimate —l/p\og{Z~^{t 2 )/Z{t 2 )) in blue, and the two time-point estimate (H 
in grey with an additional time point at U = 200 hours. The additional time-point improves the 
estimate by removing early variability on a realization-by-realization basis. 


28 




















(a) 


(c) 


p = 0.01 



(b) 

p = 0.001 



p = 0.01 



(d) 

p = 0.001 



Figure 2: Multiple progenitor in silico validation. Plots shown for two label-change proba¬ 
bilities, p = 10“^ (top row) and p = 10“^ (bottom row), (a)-(b) Sample paths of 15 independent 
realizations with 100 progenitors. Orange lines correspond to average generation, G(t)/Z{t). Blue 
lines display the estimates —l/p\og{Z^{t)/Z{t)), whose early time fluctuation is significantly less 
pronounced than with a single progenitor, as in Fig. (a)-(b). (c)-(d) For 1, 10 and 100 progenitors, 
Monte Carlo generated box plots of average generation (orange) and single time-point estimates (blue) 
at t = 250 hours, and two time-point estimates (grey), with the additional measurement at ti = 200 
hours. As predicted by theory, accuracy improves substantially, in a p dependent fashion, as the 
number of progenitors increases. 
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Figure 3: Similar results to those shown in Fig.[^ but using a cell life-time distributed according to a 
delayed Gamma distribution with shape parameter of 3, scale parameter of 1 and a delay of 7 hours. 


(a) 


(c) 



(b) 

p = 0.001 



(d) 

p = 0.001 

50 H 


40- 



0 ^ __ 

1 10 100 
# progenitors 


Figure 4: Similar results to those shown in Fig.[^ but using a cell life-time distributed according to a 
delayed Gamma distribution with shape parameter of 3, scale parameter of 1 and a delay of 7 hours. 
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Figure 5: Validation with data from time-lapse microscopy. C. elegans embryonic development 
tree taken from three-dimensional time-lapse movies of embryos recorded over approximately six hours, 
supplementary material of [S2], and decorated with a stochastic labeling with p = 0.05. (a) Lineage 
tree of C. elegans embryos, for first nine generations. The lowest node represents the progenitor 
present in the first frame, while the uppermost nodes represent cells alive in the last frame. Each 
node in between represents a division event, and its vertical position corresponds to the frame in which 
it was recorded. The color of the nodes, blue for label-positive at division and white for lab el-negative 
at division, illustrate one possible random delabeling of the trees, (b) Total population, Z{t), as a 
function of time, t, shows a step-wise increase in cell numbers followed by a reduction in growth rate 
that is not consistent with a homogeneous age dependent branching process, (c) Histogram of the 
cell lifetime shows a multi-modal distribution, partly explained by the fact that lifetimes appear to 
lengthen with time of birth or generation, (d) Box plot of the lifetimes as a function of generation, 
which shows lifetimes increasing with generation, (e) Sibling lifetimes are highly correlated, with 
many dividing during the same frame, (f) Comparison of average generation G(t)/Z(t) (orange) with 
its estimate — log {Z~^{t)/Z{t)) /p (blue). Monte Carlo determined inter-quartile ranges based on 10"* 
samples, (g) Same as (f), but starting at ti = 150 and computing the difference of the average 
generation G(t)/Zlt) (orange) since ti via the approximation ([^. 
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Figure 6: Validation with data from micro-satellite reporter systems, (a-b) Percentage of 
unlabeled cells Z+(t)/Z(t) over several days of culture for genetically modified clones of MEF cells (a, 
IM2) ') and HCT116 cells (b, [T^) carrying micro-satellite mutation reporter cassettes. The mutation 
probabilities, as reported in the respective studies are 1.1 x 10“'* [32] and 6.1 x 10’ m, which we use 
for p, the probability of delabeling per cell division. Data are represented as mean ± SEM. Eor the 
HCT116 data, the mean at each time point is over three clones, (c) Inferred relative average generation 
G(t)/Z(t) from the proportions of unlabeled cells andp using the estimator l/p\og{Z~^{t)/Z{t)) (MEE, 
open circles; HCT116, filled circles). Eor both data sets the average generation increases linearly with 
time, as expected from the theory. In addition, the slope is close to one generation per day, even 
thought the mutation probability p differs by a factor of six. The orange dashed line shows the 
increase in average generation as estimated directly from the growth curve Z(t) (panel d) assuming 
exponentially distributed life-times and no cell death (for details see main text), demonstrating the 
accuracy of the estimator. The blue dashed regression lines (r^ = 0.99 for both sets) highlight the 
linear increase of average generation estimates during the exponential growth phase, (d) Population 
size Z{t) of cultured MEF cells (open circles, [32]) and exponential growth fit to first four time points 
(dashed line, = 0.97). 
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Figure 7: Proposed probabilistic division-linked label-loss construct, (a) Schematic drawing 
of the construct. Cells initially express a protein such as Blue Fluorescent Protein (BFP). During 
each cell cycle Cre is expressed and with a small probability the gene for BFP is excised, leading 
to irreversible label-loss, (b)-(d) In silico simulation of proposed FACS experiment to determine the 
probability of label-loss, p. The simulation is parameterized as in Section 20, 000 BFP-I- cells are 
labeled with a division diluting dye such as CFSE, which is green, and cultured until they divide. At 
each division, CFSE is shared evenly between daughters and the BFP label is lost with probability 
p = 0.01. (b)-(c) Simulated FACS plots, ten and 40 hours after CFSE labeling. The fluorescence 
signals were set to realistic experimental values [53] and noise was added to initial CFSE levels for 
illustrative purposes. The gates in (b) and (c) capture the proportion of label-positive cells, BFP'^(n), 
in each generation, n. From these, the label-loss probability, p, can be experimentally determined via 
the formula p = 1 — BFP'*'(n -|- l)/BFP’^(n). As all cells in the zeroth generation express BFP, 
BFP~'’(0) = 1, and so p = 1 — BFP~'’(1) = BFP~(1) = 0.0101, the proportion of BFP negative cells 
in generation 1. (d) The orange line plots the average generation of the population, determined from 
the simulated CFSE data, a function of time. The blue line is estimate determined using the method 
described in the present paper Q with a label-loss parameter, p = 0.0101, obtained from (b). The 
CFSE determined value and the estimate exhibit excellent concordance. 


33 




















